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Graphical Fermat’s Principle and Triangle-Free Graph 

Estimation 

Junwei Lu* Han Liu' 


Abstract 

We consider the problem of estimating undirected triangle-free graphs of high di¬ 
mensional distributions. Triangle-free graphs form a rich graph family which allows 
arbitrary loopy structures but 3-cliques. For inferential tractability, we propose a graphi¬ 
cal Fermat’s principle to regularize the distribution family. Such principle enforces the 
existence of a distribution-dependent pseudo-metric such that any two nodes have a 
smaller distance than that of two other nodes who have a geodesic path include these 
two nodes. Guided by this principle, we show that a greedy strategy is able to recover 
the true graph. The resulting algorithm only requires a pairwise distance matrix as 
input and is computationally even more efficient than calculating the minimum spanning 
tree. We consider graph estimation problems under different settings, including discrete 
and nonparametric distribution families. Thorough numerical results are provided to 
illustrate the usefulness of the proposed method. 

Keyword: High dimensional graph estimation; Triangle-free graph; Graphical Fermat’s principle; 

Graphical model; Greedy algorithm. 


1 Introduction 

Graphical model provides a powerful tool to explore complex distributions. Let X = (A;,..., Xd) T 
be a d-dimensional random vector with distribution Px ■ We denote the graph of Px to be 
G = (V, E), where the vertex set V corresponds to the variables X \,..., Xj and the edge set E 
characterizes the conditional independence relationships between these variables. Specifically, two 
nodes A,; and A j are not connected if and only if they are conditionally independent given the other 
variables. To utilize graphical models, a fundamental problem is to estimate the graph based on 
observational data. 

In a graph estimation problem, we observe n samples from Px and aim to infer the structure of 
the graph G. Many existing graph estimation methods involve two steps: (1) graph metric estimation: 
estimating a pairwise “distance” matrix defined by some (pseudo) metric and (2) structure learning: 
applying a structure learning algorithm based on the estimated pairwise “distance” matrix to recover 
the graph structure. For example, for the Gaussian graphical model, the graph metric on an edge 
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(i,j) is defined as dij = where Q is the inverse covariance matrix. The metric estimation step 


estimates the inverse 

covariance matrix Meinshausen and Biihlmann 

2006: 

Banerjee et al. 

2008 

Rothman et al. 2008 

Friedman et al. 

2008 

d’Aspremont et al. 

2008 

Yuan 

2010 

Cai et al. 

2011 


Liu and Wang 


2012 


Ren et al. 


only if dij ^ 0. For t 


2014 


and the structure learning step adds an edge ( i,j ) if and 


;he Ising model, the graph metric for each edge is the absolute value of the 
interaction parameter on the edge. The metric estimation step can be conducted by sparse logistic 

and the structure learning step is determined by the sparsity 


Ravikumar et al. 


2010 


regression 

pattern of the estimated coefficients from the sparse logistic regression. A more general framework 

They assume the 


on generalized linear graphical models has been proposed by Yang et al 


2012 


nodewise-conditional distributions follow generalized linear models and the values of the parameter 
vector can be interpreted as graph metrics. Other more complex graphical models include the 

and transelliptical graphical 

model 


Liu et al. 


2012a , which consider the inverse of Kendall’s tau matrix as the graph metric 

assume the conditional means of 


Voorman et al. 


2014 


and estimate the graph by its sparsity. 

variables are additive models and propose a semiparametric method to estimate the graph. 

The above graph estimation methods are either parametric or semiparametric. To further relax 
the distributional assumption, we need to enforce more constraints on the estimated graphs. One 
popular const raint is to enforce the estimated graph to be a tree. Under this structural regularization, 
Mossel 2007 considers the information distance dij = — log | det Px, j | as graph metric for discrete 
distributions. Here Px tj is the joint probability matrix for Xi,Xj and it can be estimated based 
on the empirical distribution. In the structure learning step, methods solving minimum spanning 
tree (MST) like Kruskal’s algorithm 


Kruskal 


1956 


or Chow-Liu algorithm 


Chow and Liu. 


1968 


can be applied with the edge weights determined by the estimated graph metrics. Another example 


is nonparametric forest graphical models. Under the same tree structural regularization, Liu et al. 


2011 use mutual information as the graph metric and suggest to apply the maximum spanning tree 
algorithm as the second step for graph recovery. Assuming the existence of possible latent variables, 
Song et al. 2011 consider a nonparametric graph metric based on the pseudo-determinant of the 


covariance operator. They apply the spectral methods to estimate the graph metric and the recursive 

to estimate the graph structure. To relax the restrictive 
consider a locally tree-like graphical model. Such 


grouping algorithm Choi et al 
tree structure, 


2011 


Anandkumar and Valluvan 


2013 


model allows loopy structures whose girth is long enough such that the neighborhood of any node is 
a tree. The information distance is shown to be a proper graph metric if the model satisfies the 
correlation decay condition Georgii 2011 Mezard and Montanari 2009 Weitz 2005 . For graph 


estimation, they propose to use minimum spanning tree algorithm to learn the tree neighborhoods 
of each node and merge these local trees together. In addition to the tree or locally tree-like graph 

propose to learn 


Loh and Wainwright 

2013 

and 

Loh and Biihlmann 


2014 


the graph by estimating the inverse generalized covariance matrix for certain discrete distribution 
families. 

In this paper, we propose a new graph estimation method which allows more general graph 
structures and can handle both parametric and nonparametric graphical models under a unified 
regularization framework called graphical Fermat’s principle. The graphical Fermat’s principle 
regularizes the graphical models by assuming the existence of some pseudo-metric defined as a 
functional of the joint distribution Px such that the metric between any two nodes i, j is larger than 
the metric between nodes i',j' if both i', f lies in the geodesic connecting i,j. The corresponding 
metric is called Fermat metric. The graphical Fermat’s principle is a kind of variational principle 
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for pairwise conditional dependency on the graphical models. In particular, it characterizes the 
phenomenon that nodes with stronger dependency have shorter paths connecting them. Our graph 
learning method estimates the pairwise Fermat metrics in the metric estimation step. For the 
structure learning step, we propose a minimum triangle-free graph (MTG) estimation algorithm 
to recover the graph. Our method has three advantages over the existing methods. First, our 
method allows arbitrary graph structure without 3-cliques. Compared to existing algorithms which 
require the girth of a graph is long enough, we can handle more complicated graph structures 
as long as its girth is larger than 3. Second, the graphical Fermat’s principle holds for a large 
family of parametric and nonparametric graphical models. We show that the information distance 
dij = — log | det Px, • | for the locally tree-like graphical model, the negative mutual information 
—I(Xi,Xj) for the Gaussian tree model, and the nonparametric tree metric defined for latent tree 
model are all Fermat metrics. Third, our structure learning method MTG is computationally more 
efficient than MST. Within each iteration, MST checks whether a new edge forms a cycle but MTG 
only needs to check whether it forms a triangle. Moreover, let dij be the Fermat metric between 
nodes i,j. If there exists a sequence of r n such that the minimum metric gap between edges has 
min(jj) \dij — d t ij> \ > (log d)/r n and the Fermat metric estimator dij has an exponential 

concentration P (|dij — dij\ > e) < Ci exp(—C^r^e 2 ) for all i,j = 1,... ,d, we show that the graph 
can be consistently recovered using MTG. As two applications, we consider both discrete and 
nonparametric graphical models. For nonparametric models using the negative mutual information 
as the Fermat metric, we also propose a new robust rank-based mutual information estimator, which 
is applicable to density function with arbitrary support. 

1.1 Paper Organization 

The rest of the paper is organized as follows. In Section[2j we introduce the graphical Fermat’s 
principle. The relationship between the Fermat metric and other existing graph metrics is also 
discussed. In Sections^ we propose a minimum triangle-free graph (MTG) estimation algorithm 
and prove its consistency. In Section [4] we apply the proposed algorithm to both discrete and 
nonparametric graphical models and prove their theoretical properties. Section [ 5 ] illustrates the 
numerical performance of our method. 


2 Preliminaries and Notations 

Let X := (Xj,..., Xd) T be a d-dimensional random vector. If X is Markov to the graph G = (V, E), 
its joint density pa(x) bears the factorization 


Pg(x) = exp ( ^ 0 c (x c ) - A(0) 
VceC / 


( 2 . 1 ) 


where C is the set of cliques in G, x c is a vector indexed by the clique c, and A(9) is a probability 
density normalizer. The factorization in 2.1 reveals the topological structure of the graphical 


model. For example, the Gaussian Markov random held is a special family of graphical model with 
the density factorization 
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Pg(x) = exp ( - ^ f hj(xi - m)(xj - Hj) - - ^2 n kk (x k - p k ) 2 - ^log[(27r) d det(S)] ), (2.2) 
' ( i,j)eE Z k =l " ' 

where det(-) is the determinant, S is the covariance matrix and ft = £ -1 . The factorization in 
implies that (i,j) G E if and only if ftij ^ 0. The Ising model characterizing binary values 


j+T, — l} d follows the probability mass function 


Pg{x) = exp ( ^2 OijXjXj + ^ 4> k x k - A{9) j, 


k=1 


(2.3) 


for all x G {+1, —1} . The graphical structure can be recovered by the support of the interaction 
parameters {6ij}fj =1 . A forest model is a nonparametric graphical model whose graph is a forest 
F = (V, E) and its density Pf() can be written as 

P(Xi,X 


Mx) tel 


n^*)- 


(2.4) 


Here p(xi,Xj ) is the bivariate marginal density of X t and Xj, and p(x k ) is the univariate marginal 
density of X k . All the above models have some distribution functionals between every two nodes 
i ^ j G {1,..., d}: ftij for Gaussian graphical model, for Ising model, and the mutual information 

P{Xi,Xj) 


I{Xf, Xj) := / / p(xi,Xj) log 


dxi dxn 


(2.5) 


P(xi)p{xj ) 

for forest graphical model. They all encode the underlying graph of the model. To formalize these 
relationships, we propose the graphical Fermat’s principle, which provides a framework for a large 
variety of graph estimation problems. 

2.1 Graphical Fermat’s Principle and Fermat Metric 

In this subsection, we present the definition of the graphical Fermat’s principle and Fermat metric. 
More properties of the Fermat metric are provided later. We also discuss the relations between 
the Fermat metric and other existing graph metrics in the literature. Before rigorously define the 
graphical Fermat’s principle, we explain its motivation by reviewing the forest density estimator 
(FDE) proposed by 


Liu et al. 


2011 


For each pair of nodes i,j G V, FDE associates ( i,j ) with the 
mutual information I(Xp, Xj) as the weight and applies the maximum spanning tree algorithm to 
learn the graph. This is equivalent to set the graph metric to be dij = —I(Xi,Xj) for all i, j and find 
a minimum spanning tree (MST) based on {dij}ij^v- The reason why MST works for FDE is that 
more “important” edges with stronger dependency have smaller graph metric as dij = —I(Xp,Xj). 
Therefore, we tend to add edges with smaller graph metric as early as possible and MST matches 
this intuition. In the following, we denote the graph metric between any two vertices i,j G V as 
dij G MU{+oo|^j The graphical Fermat’s principle specifies that a good graph metric f° r 

structure learning should have the property that smaller dij implies that ( i,j ) is more possible to 
be a true edge. The following definition formalizes this intuition. 


1 One thing to note is that our definition of graph metric is quite general. It does not need to be nonnegative or do 
not have to satisfy triangle inequality. 
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Definition 2.1 (Graphical Fermat’s Principle). Suppose a random vector X 6 l d has a graphical 
density pc(x ) and G = ( V , E) is the associated graph. We say the path connecting i, j: 

(■ i,Ul ) ->■ (ui,u 2 ) (u 2 ,u 3 ) -^ . . . -t (u m _2,«m-l) (u m -l,j) 

is of length m if i 7 ^ u\ 7 ^ ... 7 ^ u m _i / j. Let Patfpj( to) be the set of edges consisting of the paths 
connecting i,j such that 


Pathjj(m) = {e € E | edge e is on a path connecting i,j with length equal or smaller than m}. 


The graph geodesic between i,j is the path with shortest length connecting i,j and we denote 
the length of the graph geodesic between i,j to be d(i,j). The graph geodesic is denoted as 
Path(i,j) := Pathjj(d(i, j)). A graph metric {dij}ijev is a Fermat metric if and only if for any 
i,j,u,v G V, the graph geodesics Path(i, j) and Path(«, v) connecting u,v and i. j satisfy 


Path(rt, v) Q Path(z,j) implies d v 


< d, 


v 


( 2 . 6 ) 


If there exists a graph metric {djjjijev satisfying 2.6 , we say that the graphical model pc(x) 
follows the graphical Fermat’s principle. 

Graphical Fermat’s principle is an analogue to Fermat’s principle in physics describing the 
behavior of light: it tends to take the path which can be traversed in the least time. In graphical 
model, the graphical Fermat’s principle assumes that the nodes connected by shorter paths tend to 
have stronger dependency. Indeed we can derive the following variational proposition of Fermat 
metric from 


2.6 


Proposition 2.2. (Variational Property of the Fermat Metric) If the graph metric is a 

Fermat metric, we have for any i.j G V that (i. j) (f E, 


dij > max d u 
(w,v)EPath(i, < 7 ‘) 


(2.7) 


Proof. For any (u,v) G Path('i, j) such that (u,v) / (i,j), we have Path(u, v) ^ Path(i, j). since for 
any path connecting u,v, we can construct a path connecting i , j by combining the Path(u, v) and 
the rest of Path(i, j). According to 2.6 , dij > d uv . Since (u, v) is arbitrary, 2.7 is proved. □ 


If we interpret the Fermat metric {dyjij ev as a pseudo distance, the variational property of 
the Fermat metric in Proposition! 2.2 |tells us that the distance between two nodes is larger than the 


geodesic path connecting them. Another interesting property of the Fermat metric is that 2.6 
invariant under strictly monotone transformation. In particular, as 


2.6 


is 

only relies on the inequality, 
a Fermat metric remains to be a Fermat metric after any strictly monotone transformation. This 
implies that if there exists one Fermat metric on the graphical model, we can derive infinite number 
of Fermat metrics by applying strictly monotone transformations. In the following, we compare the 

and give several concrete examples of the 


Fermat metric with the tree metric in Song et al. 
Fermat metrics. 


2011 


Definition 2.3 (Tree Metric Song et al. 


2011 


{dn}i,iev i s a ^ ree metric (also called information distance by 


Let T = (V, E) be a tree, the graph metric 

and 


2011 


) if for every node i,j G V, we have 


Erdos et al. 


1999 


Choi et al. 


d^ — 


^ duvi 

where Path(i,j) is the set of edges on the unique path connecting i,j. 


( 2 . 8 ) 
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Comparing Definition 2.1 and Definition 2.3 since there is only one path connecting two nodes 


on a tree, we can immediately derive that a tree metric {dij}i t j^v must be a Fermat metric. On the 
other hand, a Fermat metric is not necessarily a tree metric. First, Fermat metric can be defined on 
any kind of graphs while a tree metric can only be defined on a tree graphical model. Second, if we 
and 


compare 


2.7 


2.f 


, the tree metric requires additivity of metrics on a path while 


2.7 


is much 


weaker. Furthermore, if we apply certain monotone transform on a tree metric, it will remain to be 
a Fermat metric even if it could be no longer a tree metric. We list several examples of tree metrics 
as follows. According to the discussion above, they are also examples of Fermat’s metrics. 


Lake 


1994 


Example 2.4 (Discrete Distribution 
tree distribution, we define d tJ for every i. j that 

1 


). Suppose X = (Xi, ..., Xd) T follows the discrete 


1 


dij = -logdet (Pij) + - log det(Pji) + - logdet(P„-), 


(2.9) 


where Pij is the discrete joint probability matrix for Xj ,Xj and P lt . P,j :i are diagonal matrices with 
marginal probabilities of X* and Xj on their diagonals. 

Example 2.5 (Gaussian Markov Random Field). Let X = (Xi,... ,X^) T ~ Pt( x ) be a Gaussian 
distribution that is Markov to a tree T. We define the graph metric as the negative mutual 
information —/(X,.; Xj) = —1/2 • log(l — ), where p tJ is the correlation between Xj and Xj. 

Example 2.6 (Nonparametric Tree Graphical Model). A random vecotr (Xi, ... 1 X ( f) T Pt{x) 
follows a nonparametric tree distribution. Let 77 be a reproducing kernel Hilbert space (RKHS) and 
the corresponding kernel is K(x,y). By the Mercer theorem, there exists a feature map 4>(x) £ 77 
such that K(x, x') = (4>(x), cffpx'))^. We define the covariance operator Cjj := EyXj WX 0 ® HXj)\ 
for i,j £ V. Here < 8 > is a tensor product. For any /, g £ 77, / <8> g : 77 i —> 77 is an operator such that 
/ (g) g(h) = (g, h)uf for any /z £ 77. The pseudo-determinant of the operator C is the product of 
non-zero singular values of C and we denote |C|* = fX { cr(C). For any i. j £ V, the metric is defined 


as 


1 


dij = -^ log \CijCfjU + 7 log | CiiC-i 


1 


+ \ l °s\ CjjCjjl*. 


Proposition 2.7. The graph metrics djj in Example] 2.4j Example] 2.5] Example 
monotone transformations are all Fermat metrics. 


2.6 


( 2 . 10 ) 


and their strict 


Proof. Erdos et al. 


dij = - log |p. 


H 


ric. Similarly, Song et al 


1999 


prove that the graph metric in 


2.9 


for discrete tree model and 


for Gaussian tree model are both tree metrics and therefore are Fermat met- 
also prove that the pseduo-determinant is a tree metric. Since we 


2011 


have — I(Xf,Xj) = — l/2- log(l — pfj) = —1/2 • log(l — exp(—2 d^)) for every i, j £ V being a strictly 


monotone transformation of dij. Therefore —/(X,; Xj) is a Fermat metric for Gaussian graphical 
models Markov to trees. ffil 


2.1.1 Comparison to Correlation Decay 

In addition to the graphical Fermat’s principle, correlation decay is another regularity assumption 
that has been popularly used for inferring loopy graphs. It is first raised in statistical physics 























































and Valluvan 


2013 


applies it to the structure estimation of discrete latent graphical model by 
a local Chow-Liu grouping algorithm (LocalCLGrouping). In order to recover a graph with cycles, 
LocalCLGrouping constructs local subtrees of the graph by Chow-Liu grouping method and pieces 
them together. The algorithm is valid when the graph is locally tree-like, which is guaranteed by 
correlation decay. 

In this section, we discuss the relationship between the graphical Fermat’s principle and the 
correlation decay property. In particular, we show that the correlation decay property secures the 
existence of a valid Fermat metric. This allows us to construct concrete examples of Fermat metrics 
on graphical models even with loopy structures. 

Before introducing the definition of correlation decay, we start with some notations. Let Px\G 
be the discrete distribution that is Markov to G and Px a \f be the marginal distribution on the 
edge set A C E with the potentials on the edges E\F being zero. We denote the girth of G by g. 
We assume the random variable X{ takes values on a finite set X. Recall that the graph distance 
between two nodes is d(i,j ) and we can therefore induce the distance between two sets of nodes 
E\ , E'2 C E and d(E\,E 2 ) = min i & E 1 ,jeE 2 d(i,j). We denote Bi(i) := {j G V : dist(z,j) < 1} and 
dBi(i ) := {j G V : d(i,j ) = 1} is the nodes with graph distances equals to l. Let Fi{i\ G) := G(Bi(i )) 
be the subgraph induced by the nodes in F)(z; G). Denote \\P — Q\\i be the l\ norm of two discrete 
distribution such that ||P — Q ||i := YsgS l^( s ) — Q(s)|, where S is the state space. 


2013 


The discrete graphical 


Definition 2.8 (Correlation Decay Anandkumar and Valluvan 
model Px a \g that is Markov to G = {V, E) is said to have correlation decay if for all Z 6 N, i G V 
and A C Bi(i), 

W p x A \G ~ PxaIf^g)111 < C(d(A,dBi(i))), (2.11) 

where £ is a nonincreasing rate function characterizes the decay of correlation. 


The correlation decay in 


mode 


2.11 


implies that there is no long-range correlation in the graphical 


2013 


such that it behaves locally like a tree. The distance considered in Anandkumar and Valluvan 
is the information distance dij = — log | det(Tx, -)| for all i,j G V. As the graphical model 
Is locally tree-like, the local tree metric is defined as c?(i,j;tree) := — log | det Px t \tree(i,j) I > where 
tr ee(i,j) := G(Bi(i) U B t (j)). 

Assumption 2.9. The following assumptions are needed by 
the consistency of graph recovery of LocalCLGrouping. 

1. The marginal distributions of local tree Px, :l \tree(i,j) are nonsingular for all i,j G V and 

drnax 


Anandkumar and Valluvan 


2013 


for 


o < dmin < d(i,j; tree) < d max < oo, r/ := 


d'u 


2. The nonincreasing rate function £(•) satisfies 


v = mm 


0 5 c (I - 1 - 0 < jvp' where 

(d min , 0.5e- Wm -(e dmin - l), e -°- M -axO+2), ^ 
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Theorem 2.10. The discrete graphical model Px a \g satisfies the correlation decay in 2.11 
We define the gap between pairwise information distances dij = — log | det(Px, .)| as <5 max — 
max m=1 |d( m ) — d( m +i)|, where d( m ) is the m- th la rgest information distance, and d' max (l) := 

ld max — log(l — e Wmax £(3/ 2_z_1 )). Under Assumption 
{dij}i,jev is a Fermat metric. 

The proof of the above theorem is deferred to Appendix [A] A sufficient condition of correlation 


2.9 


if 5max > 2\X\e d '™^C,{g/‘2 -l- 1), then 


decay for Ising models in 2.3 can be explicitly established by the restrictions on the edge potentials 


of Ising models. See more details in Appendix|A.l 


3 Method of Graph Estimation 

In this section, we introduce the minimum triangle-free graph algorithm (MTG), which reconstructs 
the underlying graph with loops but no 3-cliques. Suppose that we already have an estimator for the 
Fermat metric {d*?}. J(£V - We take the \dij} i j^y as the input of our algorithm. Concrete examples 
of estimating {dylijev are shown in Sectioned 


3.1 Graph Estimation Algorithm 

The minimum triangle-free graph algorithm is similar to MST. If we denote T(V) to be the set of 
all trees with vertex set V and MST(U;d) be the minimum spanning tree with vertex set V and 
edge weights {djj}ijev> where 


MST(U; dij) := argmin \ , dij. 

TeT ^ WJeT 


The output MST( U;dj 7 ) is called a Chow-Liu tree (Since 
algorithm 


Kruskal 


Chow and Liu 


1968 


applied the Kruskal’s 
1956) to estimate a tree density). Within each iteration, the algorithm greedily 


adds the edge with smallest weight among the set of edges not yet visited if no cycle is generated. 
It is described in detail in Algorithm [T] 


Algorithm 1 Minimum Spanning Tree | Chow and Liu 


1968 


Input: Graph metric estimator d = {d t j}ij^y 
Initialize Eq = 0 
for k = 1 ,..., d 2 do 
(h, 3k) <- argmin 
Ek t Ek—i U (ffcjjfc) 

El <r- E t k _ 1 U {(ik,jk)} if E k -1 u {(ik,jk)} does not contain a cycle 

end for 

Output: The minimum spanning tree F d 2 = (V,E\ 2 ). 


Similar to MST, the intuition behind the proposed MTG algorithm is that the connected edges 
in the true graph generally have smaller graph metric, and an edge should not be added if it violates 
the graphical Fermat’s principle. The algorithm of MTG is simple: First, we sort all the edges 
according to their weights (from large to small); Second, we greedily add in these edges according 
to the sorted order. Whenever a new edge is added in, we need to make sure that no triangle is 





















formed. Otherwise, this edge should not be added in. The detailed procedure of MST is described in 
Algorithm [ 2 ] The only difference between MST and MTG is the mechanism of deleting edges. MST 
deletes edges when detecting a new cycle and MTG deletes edges when detecting a new triangle. 
This makes MTG more efficient than MST in computation since triangles are much easier to be 
detected. The other difference is that MST detects cycles in E £ defined in Algorithm 1 However, 
MTG deletes ( i k ,jk ) when a new triangle formed in E k instead of Ef, as indicated in Algorithm [ 2 ] 

Algorithm 2 Minimun Triangle-free Graph 

Input: Graph metric estimator d = {dijjjjev 
Initialize Eq = Efj = 0 
for k = 1 ,..., d 2 do 
(4 ,jfc) argmin 
Ek t E k —\ U (ikijk) 

if E k _ 1 U (ikijk) does not form a new triangle then 
Ef. t— E kl U (ikijk) 

else 

ElEl, 

end if 
end for 

Output: The minimum triangle-free graph G d 2 = (V,E^ 2 ). 


Algorithm 1 deletes the k -th edge (ikijk) when E , ^_ 1 U (ikijk) contains a cycle. Algorithm 2 
deletes (ikijk) when E k -i U (ikijk) contains a triangle. Since E t k _ 1 and E k -1 are different, without” 
any assumption, E k does not necessarily contain E k . However, the following theorem shows that if 
{dij}i,jev is a Fermat metric, E k is always a subset of Ef . 

Theorem 3.1. Let G k = (V,Eu 


, F k = (V, El) and G k = (V,Ef) be the graphs in the k -th iteration 
with the input d = {dy} i . &v , then F k is the subgraph of G k for 
every k = 0,1,... ,d — 1. The minimum spanning tree F^-i is therefore called the forest skeleton of 


of Algorithm 


and Algorithm 


minimum triangle-free graph Gd- 

-i - 

In particular, 

we 

have the following 

Greedy: 

0 

C 

G\ 

C 

g 2 

C • • 

■ • C 

G d -1 


U 


u 


u 



u 

Triangle-free: 

0 

C 

Gi 

c 

g 2 

C • • 

' • c 

G d . 1 


u 


u 


u 



u 

Chow-Liu: 

0 

c 

Fi 

c 

F 2 

C • • 

' • c 

F d - 1. 


(3.1) 


Remark 3.2. The proof is shown in Appendix [ b] From Theorem 3.1 it is easy to see why the 


graphical Fermat’s principle can only be applied to learning triangle-free graphs. Considering a 
3-clique graph, suppose two edges have been added. We cannot decide whether the third edge with 
the smallest metric should be added or not, since both cases do not violate the graphical Fermat’s 
principle. Therefore the Fermat metric cannot identify triangles in the graphical models. 


3.2 Graph Estimation Consistency 

In this section, we establish the graph estimation consistency of Algorithm [2]based on the graphical 
Fermat’s principle. We first consider the case that the input is the true Fermat metric. In the 
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following of paper, we assume that the Fermat metric always exists on the given graphical models 
and is known. 


Assumption 3.3 (Fermat Metric). For the graphi cal m odel pc ( x ), there exists a graph metric 
{dij}f 7=l being a Fermat metric defined in Defintion 


2.1 


Under this assumption, we obtain the exact recovery result of Algorithm [ 2 ] as follows. The proof 
is deferred to the Appendix [c] 


Lemma 3.4. Let the associated graph G = (V,E) of a graphical model pc( x ) be a triangle-free 
graph. If the input d = {d l j} l j^v of Algorithm^is the population Fermat metric, Algorithm)^] can 
exactly recover the true graph G. 


The output of Algorithm [ 2 ] only depends on the relative order of the Fermat metric on each 
edge. If a gap between two pairs of metrics djk and day is too small, it is hard to guarantee that 
the estimated Fermat metric on each edge follows the same order as the truth. However, for graph 
estimation, it is unnecessary to recover the relative order exactly. Since some changes of the relative 
order do not influence the output graph, we just need to ensure that those relative orders which 
change the output are recovered precisely. Define the crucial set C C E x E such that (e, e') E C if 
and only if flipping the relative order of d e and d e > will change the output graph of Algorithm [ 2 ] 
The following assumption gives the condition which guarantees the graph estimation consistency of 
MTG. 


Assumption 3.5. Let C be the crucial set. There exists a positive sequence of r n —> 00 as n —> 00 
such that the gap between the Fermat metric on C satisfies 

min I d e — d e t I > L n , where L n = Q ( \ — S ^ + °S n j ^ (3.2) 

(e,e')eC yV f n J 

where L n = fl(a n ) denotes that there exists a constant C such that L n > Ca n for sufficiently large 


Liu et al. 


2011 


3.5 


define a similar crucial set on MTG and require a similar assumption as 
is a generic notation related to the estimation rate of the metric 


The r„ in 3.2 


Assumption 

estimator. It will be specified in the next section for concrete examples. The following theorem 
characterizes how r n plays the role in terms of graph recovery. 


Theorem 3.6 (Grap h Re covery Consisten cy! 


satisfying Assumption 


3.5 


Under Assumption 


3.3 


has the following exponential concentration inequa. 


Suppose there exists a positive sequence of r n 
if the Fermat metric estimator d = 
ity related to r n for all e > L n in (3.2| 


d/j dij | 


> e) < Ci exp (—C* 2 r n e 2 ) , 


(3.3) 


where the constants C \, C 2 are independent to i, j. 
lim^oo P (G ^ G) =0. 


The estimated G from Algorithm [ 2 ] has 


Remark 3.7. We prove the results in Appendix 


D 


I 11 the theorem, the sequence {l/y / r„)}^ =1 
represents the rate of convergence for the Fermat metric and it varies with different graphical models 
and diverse metric estimators. Assumption) 3.5 [and L n in 


3.2 


In the next section, we will provide explicit rates of r n and L r 


quantify how big the gap should be. 
for concrete graphical models. 
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4 Graph Estimation Based on Fermat Metrics 


In this section, we apply MTG to learn both the discrete and nonparametric graphical models. The 
exponential concentration inequalities of Fermat metric estimators similar to 
that graph estimation consistency can be guaranteed according to Theorem 


3.3 


3.6 


are proved such 


4.1 Graph Estimation for Discrete Graphical Models 


Suppose 

Example 


A = (Xi,..., Xd) T is a discrete random vector whose output set A is finite. As shown in 
we can define an information distance dij = — log det(Pjj)+^ log det (Pa)+\ log det (Pjj) 
for any i,j G V. It has been proved in Proposition 2.7 that if the graph G of X is a tree, the 
information distance {dij}ij£v is a Fermat metric. 

Let Pa , Pjj and Pj 3 be the empirical marginal probability matrices and the information distance 
estimator is dij = — log det (Pij) + \ logdet(Pjj) + ^ log det (Pjj)- We have the following lemma on 


its estimation rate, whose proof is deferred to Appendix E 


Lemma 4.1. Suppose the marginal bivariate joint stochastic matrix Pij is nonsingular for every 


i,j G V. Furthermore, there exists a constant p o > 0 such that p 0 1 < | P, 3 \ < po for any i,j. We 


have, for all e > 4p 0 S//2 where s = |4|, 

(j exp (-dij) - exp(—djj)| > e ) < 2 S+2 ex P 


PQ S ne 2 

32 s 2 


(4.1) 


Define the new graph metric d[- = —exp (—dij) for any i,j G V. 


Since d'-- is a monotone 

L J 


transform of dij , it is still a Fermat metric. We now apply the MTG to the new Fermat metric 
estimato r d' i3 = —exp (—dij) for all i,j G V and obtain the estimated graph G. According to 
Theorem 3.6 we have the following graph estimation consistency result for discrete graphical model. 


2.9 


is a Fermat metric for the 


Theorem 4.2. Suppose the information distance {<%}?,jev i n 

discrete graphical model and there exists a constant po > 0 such that pg 1 < |Pjj \ < po for any i,j. 
The gap between different Fermat metrics satisfies 


min I exp (—d e ) — exp(— d e >)\ > L n , 
(e,e')sc 


(4.2) 


where L n = D ( y /(log d + log n)/n). We obtain the G from Algorithm 2 by inputting d\ - = 
— exp (—dij). If L n < 4 p 0 s ^ 2 , we have linin^oo P ^G / G^j =0. 


Proof. As the new graph me tric { d'- }i.j^v is still a Fermat metric and it satisfies Assumption 

gives us the co ncent ration inequality of the metric estimator' 


= n due to 4.2 


, Lemma 


4.1 


€ > 


3.5 

for 


—s/2 - 1 1 1 j 

4p 0 ' > L n required in 3.3 . By Theorem 3.6 we prove the graph estimation consistency. 


for 

all 

□ 


4.2 Graph Estimation for Nonparametric Graphical Models 

Let ®i,..., x n be n data points generated from a d-dimensional random vector X := (X \,..., Xd) T . 
We denote Xi := (x,i,... ,Xi n ) T . Suppose X is Markov to the graph G = (V,E) and the joint 
density is pc(x). In this section, we consider nonparametric graphical models with the negative 
mutual information as their Fermat metric. In addition, we propose a robust mutual information 
estimator based on rank statistics and show its statistical properties. 
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4.2.1 Kernel Mutual Information Estimator 


A naive way to estimate the mutual information is to apply its definition in 
plug-in estimator 


2.5 


and construct a 


I{Ph) ■= 


Ph(xi,x 2 ) log 


( Ph{xl,x 2 ) \ 
\Ph(xi)p h (x 2 )) 


dx 1 dx 2 , 


(4.3) 


where ph(xi,Xj) and Ph(x k ) are the bivariate and univariate distribution estimators either through 
the empirical distribution or the kernel density estimator. Specifically, the kernel density estimator 
of the bivariate and univariate density functions are 


Ph(xi,Xj) ■=-^Y, K 


nh 2 

for every i,j,u = L .. 
methods 


k =1 


Xi X{k 

h 


K 




k=1 


Xu X u k 

h 


(4.4) 


Liu et al. 
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d. Because the estimator in 4.3 needs double integration on R , existing 


generally assume the density function that its support must be on 
a unit square. However in practical applications, a large family of random variables have unbounded 
supports. In order to bridge the gap between the real world applications and theoretical analysis, we 
propose a robust estimator applicable for unbounded random variables. Our estimator is grounded 
on a key observation that the mutual information is equivalent to a negative copula entropy. In 
particular, the copula of ( Xi,Xj ) is defined as the joint cumulative distribution function: 


C(ui,uj) = P (Ui < Ui , Uj < Uj ) = F(Xi < (FW)- 1 ^),^ < (PW)' 1 ^)), 


(4.5) 


where F^(xe) = P (Xg < xg) is the cumulative distribution function of Xg, and Ug = F^pXg). 
The corresponding copula density is c(ui,Uj) = d 2 C/duiduj. It is known that (Ui, Uj) has uniform 


marginal distributions 
eliminated in ( U,Uj] 


Sklar 
and the 


1959 


The information of the marginal distribution of ( Xi,Xj ) is 


copula density characterizes the dependency between Xi and Xj. 
Moreover, there is a direct connection between the copula entropy and mutual information: 


I(Xi',Xj) = —H c (Xi,Xj) = // c(ui, uj) log c(ui,Uj)duiduj. 


(4.6) 


This is a key property for our robust mutual information estimator. From 


4.6 


we derive a novel 


approach to estimate the mutual information by estimating the copula density through the kernel 


density estimation based on the rank statistic ( Ui,Uj) := (Fn\Xi), Fn 1 (Xj)) where F^ 1 is the 
empirical marginal distribution function of F^. Consider a bivariate random vector (Xj,Xj) 
with joint marginal density p(xi, Xj). We denote the marginal distribution function of Xi,Xj as 
F®(x) ,F^\x). Fn and F^P are the corresponding empirical cumulative distribution functions of 
the marginals of p(xi,Xj) where 




nM 


1 71 

F n k) (x) = - V t{X k < x}, for k = 1,..., d. 

n ^' 


k=1 


Denote U k = F^ k \X k ), U k = F^ (X k ) and the samples for i = 1 ,... , n are dehned as 

Ui := (ua,... ,u in ) T = (F^pxn),. . . , F® (xin)) T , 

Ui := (un,... ,u in ) T = (Fp ) (xn),...,Fp ) (x in )) T . 
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Original Data 


Rank Statistic & 
Mirror Reflection 


Figure 1: The procedure of converting the data {x i}f =1 to the rank statistics {u ,;}” =1 and operating 
the mirror reflection. 


Let c(v,i,Uj) be the density of copula distribution C(v,i,Uj ) defined in 4.5 . According to 4.6 , in 
order to estimate the mutual information, it suffices to estimate the copula entropy. 


We apply the kernel density estimation to infer the copula density. To reduce the boundary bias, 


we use the “mirror reflection” kernel density estimator introduced in Gijbels and Mielniczuk 


Ch(ui, Uj ) :=-^ £2 


where {uf k ,uj k ) = (uik,u jk ), (u y £ ,u y ^) = (-u ik ,u jk ), = (u ik ,-u jk ), = 

(- u ik , -Ujk), {uf k ,uf k ) = (u ik , 2 - u jk ), (t 2 g\fi<f) = {-u ik , 2 - u jk ), (ulk^jk) = (2 - u ik ,u jk ), 

= (2 - u ik , -Uj k ), (ufk Mjk) = ( 2 - Ui k , 2 - u jk ). The kernel function K(-) is used to 
construct the muliplicative kernel K{ui,Uj) = K[uj)K(uj) and h is the corresponding bandwidth. 
To convert the data {ajj }" =1 into the rank statistics , we squeeze the unbounded random 

vector X into the unit square [0, l ] 2 as shown in Figur e|l| As there are fewer data points on the 
boundary than in the middle, the mirror reflection in 4.7 removes the boundary bias. Since all 


k =l i=\ 


v( 2 ) -( 2 ), 



1990 


(4.7) 


rank statistics locate on the grids, the kernel copula density estimator Ch is more robust than the 
classical kernel density estimator. 

To keep our estimated copula density away from zero and infinity, we truncate the Ch(ui,Uj) in 
as follows 


4.7 


C h (Ui,Uj) = 



c h {ui,Uj) < «i; 
k\ < c h (ui,uj) < k 2 ; 
C h (Ui,Uj) > H 2 - 


(4.8) 


The constants K\, n 2 will be determined later according to the density function. Hence we have the 
mutual information estimator: 

Iij(c h ) = -Hij(c h )= c h {ui,Uj)\ogc h {ui,Uj)duiduj, (4.9) 

JJ[ o,i ] 2 


for each pair i,j = 1,... ,d and obtain a d x d mutual information matrix M = [Iij(ch)\- 
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4.2.2 Mutual Information Concentration Inequality 

For any i,j = l,...,d, the marginal bivariate density function of (X,;, Xj) is p(xi, xj). Let c(ui, Uj) 
be the copula density of p(xi,Xj), and it satisfies the following assumptions. 

Assumption 4.3 (Density assumption). We assume the density c(v,i,Uj) belongs to a compact 
supported 2nd-order Holder class E K (2 , L) and is bounded away from zero and infinity. The copula 
density satisfies 

1. (Boundedness) There exist constants n \, n 2 such that 

0 < Ki < min c(ui,Uj) < max c(ui,u,j ) < u 2 < oo: 

(ui,uj) e[o,i ] 2 (ui,uj)e[o,i ] 2 


2. (2nd-order Holder class) For any (tq, Uj) T G [0, l] 2 , there exists a constant L such that, for any 
s,t> 0, if dc/du and dc/dv are the first and second partial derivative of the copula density 
c(u,v), then 


dc(ui,Uj) dc(v,i, Uj 

cyui + s, Uj + t) — c[Ui, Uj) --—— s --—— t 


du 


dv 


< T(s 2 + 1 2 ); 


3. (Boundary Assumption) The partial derivative of copula density c(ui,Uj) decreases to zero on 
the boundary, i.e., for any sequence {u[ n \u^}™ =1 G [0, l] 2 converges to some point on the 
boundary, we have 


lim 

n—>oo 


dc{uf^ ,u^) 
du 


lim 

n—»• oo 


dc(u\ n \u^) 

dv 


= 0 . 


We also need a kernel function K(-) satisfying the following assumptions: 

Assumption 4.4 (Kernel assumption). The kernel function satisfies 

1. (Symmetry) The kernel K(-) is a symmetric and continuous density function with bounded 
support [—1,1], 

2. (Lipschitz) For any U\,U 2 G [—1,1], the kernel K(u) has the Liphschitz continuity 

\K(ui) - K(u 2 )| < L k \ui - u 2 1 . 


For the mutual information estimator we proposed in 4.9 , we have the following exponential 


concentration inequality. The proof is deferred to Appendix 


Theorem 4.5. For a bivariate random vector ( Xj , Xj) with the copula density satisfying Assumption 

using kernel function K(- ) satisfying Assumption 


4.3 


4.4 


4.8 


The kernel copula density estimator c/,_ in 
Let /*■ = I(Xi] Xj) be the population mutual information between Xi,Xj and Iij(ch) he the 

If we choose the bandwidth h x (logn/n) 1 / 6 , then for sufficiently large 


estimator proposed in 
n. 


4.9 


sup P(| Iij(c h )-Itj\ 

(2 ,L) 

where k = max{ | log \, | log \} + 1. 


> e) < 2 exp [ — 


(n 2 logn) 1 / 3 f 
128 u?L 2 k 
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According to Chebyshev’s inequality, we can easily get the upper bound of the absolute error 


from Theorem 4.5 


Corollary 4.6. Under Assumptions 4.3 and 4.4. we have 


Liu et al. 


E (| Iijich) - /*|) = 0(n-V 3 log 1 / 6 ™). 

, the mutual information is estimated based on the classical 
Both the bivariate density functions and marginal 


2011 


2012 c 


2.5 


Remark 4.7. In 

kernel density estimators according to 
univariate functions should be estimated. The corresponding estimator is 

m) := //?*(*!,*,) log (444,) <**, tel. 


(4.10) 


The bivariate density function estimator ph(x 1 , 0 : 2 ) is obtained by mirror reflection kernel density 
estimator similar to 


4.7 


Y n 9 

Ph(xi,Xj) := 

k =1 e =1 


K 




W' 

ik 


K 


Xj — X 


{(■)' 

jk 


(4.11) 


where (x^\x^) = (x ik ,x jk ), {xf k ,xf^) = (- x ik ,x jk ), (x^\x^) = (x ik ,-x jk ), (x$,xf^) = 
(~x ik , -x jk ), = (x ik ,2- x jk ), (x^,xf k ) = (~x ik ,2- x jk ), (x\l\x$) = (2 - x ik ,x jk ), 


„( 8 ) J8h 


„( 9 ) 


as m 


4.8 


The univariate density estimator is derived following the one dimensio n version o f 
Here we choose the bandwidth h x n -1 / 4 for bivariate and univariate densities in 
Then the entropy estimator for each node t is obtained by 


a version of 

4.11 

Liu et al. 

2012 c 


H p (xi) = - JPh(xi) log p h (xe)dx e . 


(4.12) 


If we compare I(ph) in 4.10 with the copula-based estimator /(c/,) in 4.9 , for I(ph ) we need to 
estimate both the bivariate density ph(xi,Xj) and univariate densities p k {xi), ph(xj). For I(ch), 
we only need to estimate bivariate copula density Ch ('iM , Uj ). Our estimator is more efficient in 
computation. 


Remark 4.8. Using the bandwidth h 
has been proved by 


n 4 / 4 in 


Liu et al. 


2012 c 


[0, l] 2 and p € E K (2, L), we have 


4.11 


, a corresponding concentration inequality 
for I(ph)- If the density p(xi, Xj) is supported on a unit square 


sup F(\I(p h ) - I (p)\ > e) < 6exp (■- 
P es K ( 2 ,L) v 7 V 


ne 


324k 2 ) 


(4.13) 


Comparing 3.3 with 


4.13 


, our estimator has a slower rate of convergence. However, our 
estimator is robust t o outliers and doe s not require the density function has a compact support on 
[0, l] 2 as required by Liu et al. 2012c . 
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4.2.3 Graph Estimation Consistency 


In this section, we use the negative mutual information as the Fermat metric, and establish the graph 
estimation consistency of Algorithm [ 2 ] More specifically, we use the copula mutual information 
to estimate the Fermat metric and plug it into Algorithm [ 2 ] We call this method as 


4.9 


estimator 

Kernel Information-Theoretical Estimator (KITE). The following theorem shows the consistency of 

KITE. 

Theorem 4.9. Suppose the graph metric dij = —I(Xi ; Xj) is a Fermat metric for the nonparametric 
graphical model. Under Assumptions|4.3j|4.4j and the assumption on the gap of the crucial set: 


min \d e — d e i\ > L r 
(e,e')£C 


where = fl 


/ log d + log re 
(re 2 logn) 1 / 3 


(4.14) 


if we choose bandwidth h x (log n/n) l ^ b for the mutual information estimator Iijfch) in 4.9 , then 
the estimated G derived from Algorithm 2 with d t j = —Iij(ch ) as input has lim n _ ) . oc 




4.5 


and 4.14 


Proof. Combining Theorem 
graph estimation consistency follows from Theorem 3.6 


which satisfies Assumption 


3.5 


G ± Gj = 0. 
for r n = (re 2 log re) 1 / 3 , the 

bis 


4.2.4 Skeleton Forest Density Estimation 

Let p{x) be the true density that generates the data. I 11 this section, our goal is to estimate the 
skeleton forest density q* defined as 


q* = argminD(p||g), 
g&'Pd, 

where D(p\\q) is the Kullback-Leibler divergence between p and q, and Vd is the family of distributions 

we can use Algorithm 


satisfying the tree factorization 


2.4 


4.2.3 


Under the assumptions in Section 
[2]to estimate a graph G with loops. However, it is hard to estimate the density p due to the curse 
of dimensionality. 

In this section, we present an algorithm that estimates a skeleton forest density q* in the form 
of (2.4p. Such a skeleton forest density provides the best tree approximation to the true density 

propose to first estimate the forest graph 


s 

x). To 


Liu et al. 


2012 c 


2.4 


p(x). To estimate the skeleton forest q* 

structure of q* by MST and plug in the bivariate and univariate kernel density estimators into 
to estimate q*. However, their method relies on the assumption that q* has a bounded support. In 
this, we propose an estimator of the skeleton forest densities with unbounded supports. 

Let F* be the graph of the forest density q* and wg call it the skeleton forest. We estimate 


with dij = -Iij(c h ) for i,j = 1,,d as 


F* by F = ( V,E ), which is the output of Algorithm 
the input. 

the standard kernel density estimator in 

density estimator ph 2 (xj,x k ) and the univariate density estimator Ph 2 {x 3 
nodes j, k are estimated by choosing /12 x re -1 / 6 for the kernel density estimator in 


4.4 


for all i E S. We choose h\ 


Denote S as the set of vertices isolated from others in the forest F. Let p} M (xg) be 

The bivariate 
for these nonisolated 
Then, we 


obtain the skeleton forest density estimator 


4.4 




x = 


n 

U,k)eE 


p h 2 {Xj,X k ) 
Ph 2 {Xj)p h2 (x k ) 


■ n Ph 1 (f. 


il£S 


n Ph 2 (xe). 
e&v\s 


(4.15) 


16 























The following theorem presents the estimation rate of the skeleton forest density estimator 
whose proof is provided in Appendix [G] 


4.15 


Theorem 4.10 (Density Estimation Consistency). Under Assumptions 

n -r/5 


4.15 


4.3 


and 4.4 


if 


PP 


m 


is obtained by choosing hi 


h -2 x n 1//6 and log d/(n 2 logn) 1 / 3 = o(l), there exists a 


constant C such that 


+ 


d — s 


^\\Pf 9 \\ Ll ~ C \ n 2/3 1 n 4/5 ’ 

where for any function /(x), its L\ norm is defined as ||/||li = / \f{x)\dx. 


Remark 4.11. The L\ rate of convergence in Theorem 4.10 for density functions with arbitrary 
supports matches the one for bounded support density functions in Liu et al. 2012c with a 
suboptimal scaling logd/(n 2 logn) 1 / 3 = o(l). 


5 Experimental Results 


In this section, we compare the numerical performance of KITE with other methods. We first 
introduce a pruning method to regularize the size of estimated graphs. Next, we compare KITE 
with three other methods: FDE, GLASSO and refit GLASSO. FDE Liu et al 


Chow-Liu algorithm on the kernel density estimator in 4.9 . GLASSO Tibshirani 


2011 


1996 


applies the 
estimates 


the inverse covariance matrix under the Gaussian graphical model through i\ regularization, and the 
refit GLASSO is a two-step method based on GLASSO: (1) obtaining an inverse covariance matrix 
by GLASSO and (2) refitting the model without t\ regularization on the estimated support. At last, 
we apply these methods to a genomic dataset. 


5.1 Graph Pruning 

We introduce a graph pruning procedure to avoid overfitting. We divide the dataset V = {x \,..., x n ) 
into two parts: the training data T>\ with size ni and the held-out data X >2 with size n. 2 - We first 
use V\ to obtain mutual information estimators / n , (X,/. Xj) for all i.j £ V by 4.9 and the graph 
estimator G ni . On the held-out data T> 2 , we also estimate the mutual information I n 2 (Xf, Xj) for 

More specifically, we prune 


4.12 


any i,j 6 V and estimate the marginal entropy by H n2 (Xk ) in 
the graph estimator G ni by estimating the skeletion forest density. Let FAy be the forest generated 
in the &:-th iteration of Algorithm 111 and we choose the estimated k to be 


k = argrnax 




I n 3 (Xi,Xj)- 


y. 


Hn 2 (X £ ), 


(5.1) 


where E{F^ 
that the maximizer of 


is the e dge s et of and V(F, 


5.1 


change for some iterations. In general, k in 


n] 1 ) is the set of non-isolated vertices of . Note 
may not be unique since the value of the objective function may not 
is a set of maximizers and we choose k = max k as 


5.1 


( k ) : 


the iteration step to prune. We choose the graph G ni 
pruned graph estimator. 


in the k -th iteration of Algorithm 


as the 
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5.2 Synthetic Data 


We consider four types of graph construction: hub, constellation, band and cluster to generate 
inverse covariance matrices X -1 . An illustration of these four graph patterns is shown in Figure 
[ 2 ] and Figure [3j All patterns are composed of several small connected components with special 
structures. The hub graph can be decomposed into stars. In each star, one node is chosen as the 
hub and all the others are connected to the hub. The constellation graph consists of subgraphs 
with at most one cycle. The subgraphs of band graph are bands such that each pair of nodes with 
coordinates i , j in one band is connected if \i — j\ < 2. The cluster graph is partitioned into several 
disconnected Erdos-Renyi random subgraphs with probability parameter 0.2. The support of 5R 1 is 
determined by these graph structures. The off-diagonal entries on the support of S -1 are uniformly 
generated from [—30,10]. The diagonal entries are chosen such that the smallest eigenvalue of S is 
1 . 


Based on the covariance matrix X, we generate n\ = ri 2 = 100 samples of the d-dimensional 
Gaussian data xi,...,x n ~ N(n, X) with d = 100, mean vector p = (0.5,..., 0.5). To test 
the robustness of esti mators, we distort th e data by applying the Box-Cox transformation yik = 
(sign(xj/ c )|:Cjfc| l/ — 1 )/v Box and Cox 1964 for i = 1,..., d and k = 1,... ,n. In the simulation, we 
choose v = 2.5 and obtain distorted Gaussian data y 1 ,..., y n . The third nonparanormal dataset 
is generated by letting z^ ) for i = 1,..., d and k = 1,..., n, where Fi(-) is the marginal 

distribution function of X t for i = 1 This transformation keeps the graph structure of 

..., Xd) while the transformed data no longer follow the Gaussian distribution Liu et al. 


(X, 


2009 


4.9 


and apply MST 


When applying FDE, we use the mutual information estimator I(ph) in 
to obtain the estimated forest F. 

The log-likelihood function on the held-out dataset V >2 is applied to compare the performance of 
density estimators by KITE, FDE, GLASSO and refit GLASSO. For KITE with the estimated skeleton 
forest F = (V. E ) and the isolated nodes S , the log-likelihood function is 


^ kite ~ nn X! log ( n 


p h2 {x\ s \x^) 


n 2 


s&v 2 


p h2 (xhp h2 (xn 


(«)'! 


■ (j.fc)G E 


n^(^)- n pi^ x > 


(s)^ 


l(zS 


£ev\s 


and the held-out log-likelihood function of FDE is 

1 


f H e = -E lo ef n ^ 




For GLASSO and refit GLASSO, let fi, ni ,Fl ni be the estimated mean and inverse covariance 
matrix using V 1 , the held-out log-likelihood function is 

4 a uss = — X | 2^ (S) _ ^' n ^ T ^ n ^ X< ' S ' > ~ R"') + 2 log ^(27 t)^ j ’ 

We also tune the regularization parameter of GLASSO using the log-likelihood function above on 
the held-out data. 

The graph estimation results for various graph patterns are shown in Figure [ 2 ] and Figure [ 3 ] We 
observe that GLASSO tends to over-select edges between two disconnected subgraphs. FDE tends 
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Figure 2: True and estimated graph structure for hub and forest graph models. For each graph 
pattern, the true graph (first row) is compared with the estimated graphs by GLASSO (second row), 
FDE (third row) and KITE (forth row) on Gaussian data (first column), distorted Gaussian data 
(second column) and nonparanormal data (third column). 


to achieve a graph estimator sparser than the true graph, because it enforces the forest structure. 
On the one hand, this helps FDE to discover the connected structures, but on the other hand, the 
forest structure makes the FDE graph estimator have many false negative edges. KITE balances the 
performance of detecting the connected structures and exploring the structure for each subgraph. 
For the distorted datasets, simulations demonstrate that KITE is more robust than GLASSO and 
FDE. 

From the values of held-out log-likelihood functions shown in Figure [4] the refit GLASSO is 
better than the other three methods on the Gaussian datasets because it uses the correct model. 
For the nonparanormal distribution data, although neither the Gaussian graphical model nor the 
forest density is correct, KITE and FDE have larger held-out log-likelihood function values than 
GLASSO and the refit GLASSO. This observation implies that the nonparametric forest density 
family have better approximation than the Gaussian graphical model. We see that KITE has similar 
performance on Gaussian dataset as FDE but outperforms FDE on the nonparanornral datasets. 
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Figure 3: The true and estimated graph structures for band and cluster graph models. For each 
graph pattern, true graph (first row) is compared with estimated graphs by GLASSO (second row), 
FDE (third row) and KITE (forth row) on Gaussian data (first column), distorted Gaussian data 
(second column) and nonparanormal data (third column). 


5.3 Gene Network 


We apply the graph estimation methods on the genomic dataset in 


Wille et al. 


2004 


The dataset 


consists of gene expression arrays for Arabidopsis thaliana from n = 118 samples. We focus on 
d = 39 genes: 16 in the cytoplasm from the mevalonate (MVA) pathway, 18 in the chloroplast 
from the plastidial (MEP) pathway, 5 in the mitochondria. We first take a log-transformation and 
standardize the original gene expression level data: KITE, FDE and GLASSO methods are then 
applied to the preprocessed data to estimate the gene network. 

The estimated gene networks are demonstrated in Figure[5j Even though these genes are located 
in different pathways, all the estimated graphs from these three methods are connected. This shows 
that there are cross-pathway links which is consistent to the conclusion in 


Wille et al. 


2004 


. We 

can see that the gene network derived from KITE both reveals the cycle structures and has a more 
concise edge selection than GLASSO. 
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Figure 4: The held-out log-likelihood functions for GLASSO (blue circle), refit GLASSO (red star), FDE 
(green circle dash line) and KITE (black step function) on the Gaussian (left) and nonparanormal (right) 
datasets under the hub, constellation, band and cluster graph patterns. The red dash vertical line marks the 
true graph size. 


6 Conclusion 

We propose the graphical Fermat’s principle as a new regularization on graphical models. The con¬ 
sistency of density estimation and graph estimation are proved under the exponential concentration 
inequality of the Fermat metric estimator. Concrete examples on the discrete and nonparametric 
graphical models are discussed. For nonparametric graph estimation, we also propose a robust 
copula-based mutual information estimator. Numerical experiments show that our estimator can 
robustly estimate the loopy graphs even on the contaminated data and outperform many existing 
graph estimators. 


A Proof of Theorem 2.10 


We first prove that the local tree metric d tre e(lO = d(i,j; tree) = — log | det P x , ,|treep,-/~)l is a Fermat 


metric. According to the Fact 2 in the supplemental materials of Anandkumar and Valluvan 


2013 


d tree (10 forms a tree metric on the subtree of G. Now we verify 2.6 is true for d tre e(!0- 
For any i,j £ V , since the geodesic distance < [fj) we have G{Bi(i) U Bi{j )) is a tree and 
Path+ 1) C tree(i,j). Therefore, if two vertices u,v with Path UtV (m uv ) C Path ij(rriij), we 
have Path u , v {m uv ) C tree(i, j) as well. As the distances dt ree (-Bz(0 U Bi(j )) forms a tree metric, due 
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Figure 5: True and estimated gene network structure by GLASSO (top left), FDE (top middle) and 
KITE (top right). 


to the additivity of the tree metric 2.8 , we have 


d(i,j; tree) = 


E 


d(s, t; tree) < 


E 


d(s, t; tree) = d(u, v; tree). 


(s ,t ) GPathi j {rriij ) 


( s,t)eP&th u ,v(m U v ) 


Similarly, combining the facts that Path ij(rriij + 1) C G(Bi(i) U Bi(j)) and dt r ee (Bi(i) U Bi(j )) 
forms a tree metric, for any (u,v) E Path,; j (rriij + 1), it can be derived that d(i,j; tree) = 
E(«,t)ePathi i:; -(T»H,-+i) ^5 tree) > d(u,v, tree) which verifies [2/7 . 


Now we turns to prove the information distance dij = — log | det(P > x i ,) | is a Fermat metric. If 

Js satisfied, from Proposition 2 in the supplemental materials of 


the correlation decay in 


Anandkumar and Valluvan 
the condition that 5 max > 2 


2.11 

1 


2013 


, we have | dij — d(i, j;tree)| < \A\e d ' m ^ l ' ) C(g/2 — l — 1). Under 
,max ^ ^ X\e d ™^ l \(g/2 — l — 1), {d t j} l ,j ( zy. the information distances preserve 
the relative order of the local tree metric d(i,j; tree). Because the graphical Fermat’s principle is 
invariant under monotone transformations, we have is a Fermat metric. 

A.l Correlation Decay under Ising Model 

In this section, we discuss the connection between the correlation decay and Fermat principle under 
Ising model. 


Assumption A.l. The following assumptions are considered by Anandkumar and Valluvan 


2013 


for the consistency of graph recovery of LocalCLGrouping for Ising models Pq. 

1. The edge potentials satisfy 8 m in < \&i,j\ < 8 m ax , for any i,j E V. 

2. We define the attractive counterpart Pq = exp j)eE \@ij\ x i x j + Yliev 10*1 x i ~ A(|0|)^ 
and $ max := max^gy atanh(E(Aj)), where the expectation E is taken over Pq- Let Px\ , = 
exp (9 max xiX2 + 0max("U + x i) ~ A(0i)) and P x’. 2 = ex P (^min®i ®2 - A(0 2 )) where A(G 1 ),A(G 2 ) 
are the normalizers. We define d min = — log | det(Pxi, 2 )l> %ax — — log | det(Pj^ )| and 


V • ^max/^min* Let A r 


•"max 

assume 


be the maximum node degree and a := A ri 
a 9 / 2 

= o(l), 


tanh 8 n 


< 1. We 


V(V+ 1)+2 


8 


















where o(l) is with respect to the number of nodes d. 


Therefore, we can derive the following result characterizing when the information distance dij in 
Ising models will become a Fermat metric. 


Corollary A. 2. Let pc( x ) be the density of Ising model in 2.3 . Under Assumption A.l if 
<5 max > 2| X\e d ™^ l \{g/2 — l — 1), the information distance {d t j} t ,j^y is a Fermat metric. 


Proof. In the proof of Theorem 1 in 
Assumption A.l due to t 


Anandkumar and Valluvan 


2013 


according to Theorem 


2.10 


it is shown that under 
re property of Ising models, Assumption 2.9 is satisfied. Therefore, 
d^ = — log | det(Pxi ,)| is a Fermat metric. 


□ 


B Proof of Theorem 3.1 


Proof. If the k -th edge (ik, jk) is not added to a new triangle must form in Ek and we denote the 
third vertex of the triangle as h. Due to the algorithm, we have (h,ik) £ Pfc-i and (h,jk) £ E^_\. 
We claim that there must exists a path connecting h and ik in Fk-\. Suppose ( h , ik ) is considered in 
the t-th iteration of the algorithm and as ( h , i *,) £ Ek- i, we have l < k. If ( h , ik) £ E £, ( h , ik) is the 
path in the claim. If ( h,ik) (f E ji, a cycle must be formed in E\ in the l -th iteration when adding 
(h, ik)- This implies that there is a path connecting h, ik in Fg. Similarly, it can also be derived that 
there is a path connecting ( h,jk ) in Fk-i as well. Therefore, the edge ( ik,jk) will form a new cycle 
if it is added to Fk-i and we have ( ik,jk) ^ E{. We conclude that E £ C Ek for every k = 1,..., d, 

' ’ □ 


and Algorithm [ 2 ] generates three sequences of graphs with the filtration structure in 


3.1 


c 


Proof of Lemma 


3.4 


Proof. We use the same notation as in Algorithm[2]and denote the fermat metric of any edge e £ E 
as d e . We will prove inductively that for every /c-th iteration, the ek = (ik,jk ) will be added to Ek if 
and only if is the edge in true graph G. Let e\ = (i\,j\) be the edge with smallest graph metric, 
which is considered in the first iteration, e\ must be a true edge, otherwise according to 2.6 there 
must exists e' £ Path(*i,ji) and d e i < d ei which violates the fact that d ei is the smallest. 

For the k- th iteration and the edge ek £ E. the induction hypothesis is Ef_ ] c E. Suppose 
Ek -1 U {ek} forms a new triangle, and we denote the other two edges of the triangle are ( ik,u) 
and (u,jk). As G is triangle-free, there must be at least one of these two edges, say (u, jk) ^ E 
such that Path(?'fc,u) / 0 (otherwise contradicting the fact that dt u ,j k ) < d ek ). 

Let path(ifc,u) be a path connecting u,i k , we construct a path (jk,u) = path(ifc,n) U {(ik, jk)} C 
Path(u,jfc) while > d(i h ,j k ) which contradicts 2.7 . Therefore, if £ E, it will not form a 


new triangle and hence be added into P( fc n. 

If e-k E, let the path connecting ik,jk be path (ik,jk) C E. Find a node v £ V such that 
(v,ik) £ Path (ik,jk), and it is easy to see that Path (v,jk) § Path^,^). According to i 
have < d ek and d( v j k \ < d ek . This implies that the edges (v,ik), (v,3k) £ Ek -1 and ek will 

not be added to E\, as a new triangle is formed in Ek = E k ~ 1 U e^. 

In conclusion, previous induction reasoning demonstrate that G,p = G. □ 


2.6 


we 
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D Proof of Theorem |3.6| 

Proof. Since the output of Algorithm [ 2 ] only depends on the relative order of mutual information, 
we have 


G / G* ) < P (sign(d e — d e /) 7 ^ sign(d e — d e /) for some e, e' G E 
d e — d e >) ■ ( d e — d e i ) < 0 for some e, e' G E 

< — max P ((d e — d e ') ■ (d e — d e >) <0 for some e, e € E 

2 (e,e')g C \ 

Here the first inequality is because the relative orders of edges are same if the estimated graph is 
same as the truth. The last inequality is according to union bound. According to Assumption ! 3.51 
we have min( ee /) e c |d e — d e >\ > 2 L n , where L n = 0(y / (logd + logn)/r n ). Thus 


max P (( d e — d e i) ■ (d e — d e A for some e, e' E E 
(e,e')£C 


< maxi 
e&E 


(| d e - d e | > 2L n ^ < Ci exp (- C 2 L 2 n r n ) , 


and the last inequality follows from 3.3 . Combining the above two inequalities, we can derive that 


G / G*^j < — ^ max P d e — d e >) ■ (d e — d e i) <0 for some e, e' € E 

< d 4 max P ( I d e — d e I > 2 L 

eS E V 

< Cid 4 exp (- C 2 L 2 n r n ) = 0 (exp (41ogd - C 2 L 2 n r n ))) = o(l). (D.l) 

Therefore, we complete the proof of the theorem. □ 


E Proof of Lemma |4.1| 

In order to show the concentration of the Fermat metric d'- = — exp (—d^) for d t j = — log | Pjj \ + 

| log \Pa\ + ^ log | Pjj |. We first bound the concentration determinant of the empirical stochastic 
matrix by the following lemma. 

Lemma E.l. Anandkumar and Valluvan 
based on the data with sample size n, and we have 


2013 


Let P G R sxs be the empirical stochastic matrix 


det (P) — detP| > e\ < 2 


< 2 s exp ^- 7^2 J ■ 

' we prove the concentration inequality 4.1 . First, according to the 
dij, we can derive that 


Now 


definition of d^ and 


| exp (—dij) - exp(-dij)| > e) = P(| det(P ii 1/2 P ij P jj 1/2 )-det(P ii 1/2 P ij P jj 1/2 )\ > e) 


< T1+T2+T3, 
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where the last decomposition is due to union bound such that 


?i = P (J det( 4 1 / 2 ) _ det(/V 1/z )| • | det(i^) det(/V 1/z )l > e/3) ; 
T 2 = P (| det (P7 1/2 ) - det(pr 1/2 )| • | det(/V 1/2 ) det (4) I > e/s) ; 
T 3 = P (| det(4) - det(Pjj)| • | det(/V 1/2 ) det(/V 1/2 )| > e/3^ 

By triangle inequality and the fact that min,; je y ||Py ||min > Po, we have 




5-l/2x 


Ti + r 2 + r 3 < 


2s \ / g 

det(Pjj) - det(P»i)| > +P M det(P J y) - det(P,y)| > y 

+ P (| det(Pji) - det(Pjj)| > + P (j det (4) - det(P ij )| > y 

Po e 


+ P (J det(Pjj) — det (Pjj) | > ^ 

Now we transfer the concentration of metric estimators to the concentration of the determinant of 


probability matrix. Therefore, according to Lemma E.l. we have 


exp (-dij) - exp(-dij)\ > e) < Ti + T 2 + T 3 


< 3 • 2- exp (-«£=') < exp 


32 s 2 


8s 2 


32s 2 


where the last inequality is derived from e > 32 p 0 s . 


F Proof of Theorem 4.5 


In this section, we will analysis the statistical properties of kernel copula density estimator and give 
a proof of Theorem 4.5. The error |P(cp) — /(c) | could be decomposed as two parts 


I I(c h ) - /(c) I < I I(c h ) - EI(c h ) I + |E I(c h ) - /(c) I. 


The first part on the right side of the inequality is bias term and the second part is variance term. 
We are going to analyze these two parts separately and show that 


Bias Term : 


Variance Term : 


sup |E/ (dh) — / (c)| <C\ 

cGS k (2,Z/) 


logn\ 173 
n ) ’ 


sup 

c£S k ( 2,L) 




where C 1 is a constant only depend on the properties of copula density and kernel functions and 
the bandwidth is set to be h x (log n/n) 1//6 . 
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F.l Bias Term Analysis 

For the purpose of estimating the bias of mutual information, we first estimate the integrated 
squared bias of kernel copula density estimator in the following lemma. 

Lemma F.l. Under Assumption |4.3| and |4.4j we have for some constant c, 
sup / (E Ch(ui,Uj) — c(ui , Uj )) 2 dui duj < c 

cSS K (2,L) 4[0,1] 2 


4 logn J. 
nh 2 n 8 


Proof. As we apply mirror reflection kernel estimator in 4.7 the estimator’s behavior is different on 
the boundary area and central area of the unit square. We will analyze the properties in various 
areas. 


'[o,i ] 2 


(E Ch{iii, Uj) — c('Ui,Uj)) 2 dui duj 


= + + (E c h ( Ui , Uj ) - c(ui,uj)) 2 dm duj =: T A + T B + T c , 

Ja Jb Jc 

where A is central area, B is the four corners and C is four margins defined as follows: 


A = [h, 1 — h] 2 

B = [0, h ] 2 U [1 — h, 1] x [0, h\ U [0, h\ x [1 — h, 1] U [1 — h, l] 2 
C = [0,1] 2 \(AU6) 


We substitute Ui = F^\xi), Uj = F^\xj) for Uj, Uj in 4.7 , we will get the standard kernel 


density estimator of copula density and we denote it as c^ td («j, u 


F.1.1 Analysis of Central Area Ta 

For any ( Ui,Uj ) £ A = [h, 1 — /i] 2 , the mirror reflection kernel estimator is 


C h (Ui,Uj) 


1 ts ( Ui ~ Uik A 
nh?^ K \ h ) 

k= 1 v 7 



According the formulation of Ch(ui, Uj), we use the variance-bias decomposition and obtain that 


T a < 2 


IA 


EP h - E? h td j 2 dui duj + 2 J (ETf 1 - c) 2 du { d Uj =: 2{T^ ] + rf } ), 


4.7 


The 


where c| td is achieved by substituting Xi = F^ l \xi), Xj = F^\xj) for 
second term is integr ated square bias for standard kernel density, we have already known that 
su P P eE( 2 ,L) T^ < ch 4 Silverman 1986 . To estimate T^\ denote Kh(-) = hr l K(-/h) and we plug 
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in the constructions of copula estimators and have 


(E Ch — Ec| td ) 2 dui duj < 


A 


[Oh] 2 


2 \n 


dui duj 


[o,i ] 2 


1 n 

- \K h (ui - F^\x ik ))K( Uj - F^(x jk )) 

k =1 

- K h (ui - F^{x ik ))K h ( Uj - 
1 n 

- I K{ui ~ h~ l F$ {x ik j)K[ Uj - h-'FPixjk)) 

k =1 

- K( Ui - h- l F^{x ik ))K{ Uj - h - 1 


2\?l 


dui duj , 


where the last equality is because of change of variable. We define A k (xi,Xj) = max{| Fn\x{ k ) — 
Fn\xi k ) |, | Fn\xj k ) — Fn\xj k )\] and C k = 4||A’|| 00 Lfc. Due to the Lipschitz continuity of K(-) in 
Assumption ! 4.41 and the above inequality, we have 

T^ ] < ^y^2[|AT[[ 0O E (Lkj^ 1{A k (xi,Xj) < e} + 1{A k {xi,Xj) > e}^ 

< C k + F | max A k(xi, xj) > e < 2C\ (j^ + e" 4 "^ , (F.l) 


where the last inequality is derived by Dvoretzky-Kiefer-Wolfowitz inequality 1 Dvoretzky et al. 
1956 that P (supj-gjg \F n (x) — F(x)\ > e) < e“ 2ne2 for any distribution function F(-). 


Let e = yj 21 ° sn , we have < 2Ck(J og n/(nh 2 ) + n 8 ) and TJ < c[h 4 + log n/(nh 2 ) + n 8 ]. 

Similar to the analysis of , we can bound the mean square error using the same e as 


T( 2 ( < 

'A - 


E (- El K ( u ~ ^\x ik )/h)K{v - F&(x jk )/h) 

[o,i] 2 vn 


k =1 


-K(u - F^(x ik )/h)K(v - F( j \x jk )/h ) |j dui duj 

( 1 n \ : 

-£211*1100 ^L K ^tA k (xi,Xj) < e + lA k (xi,Xj) > ej J 
71 k=i ' ' 


- 2C k ( + e 


—2 ne 2 


= 2C'fc(logn/(?rh 2 ) + n 


—4 \ 


(F.2) 


F.l.2 Analysis of Corner Area 7g 

Recall that B is the four-corner area of [0, l] 2 . In specfic, B = [0, h} 2 U [1 — h, 1] x [0, h\ U [0, h\ x 
[1 — h, 1] U [1 — h, l] 2 . Without loss of generality, we analyze the left down corner B\ = [0, h] 2 since 
by symmetry, there exist a constant C > 0 such that 

Tb= (E Ch(ui,Uj) — c(ui,Uj)) 2 duiduj < C / (E Ch(ui,Uj) — c(iii,Uj)) 2 duidiij. (F.3) 

■W JBi 
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We denote the bivariate kernel function in terms of the rank statistics as 


K? h (a,b) = K 


Ui — a 
h 


K 


u j " b 

h 


For ( Ui,Uj ) E B i, according to the mirror reflect estimator defined in 


4.7 


we have 


1 n 

c h {ui, Uj) = ^2 [ K 2 ' h (uik, Ujk) + K%’ h (-u ik , u jk ) + K%’ h (u ik , -u jk ) + K%’ h (-u ik , -u jk ) 

k =1 

Similarly to the analysis of Ta, we separate it into two parts: 

T Bl < 2 J (Ec h -E ') 2 dm d Uj + 2 J (e Tf 1 - c) 2 dm du 3 = 2{T^ + t£ ] ) . 


Similar to the analysis in 
we first study 


F.l 


we 


have 7g^ < 2Ck (log n h 2 n 1 + n 8 ). In order to bound 


B i 


E^ d (uj,uj) 


1 /•! r 


h 2 



K 2 ' h (s,t) + K 2 ,h \—s,t) + K 2 ' h (s, — t ) + K 2 ' h \—s, — t) c(s,t)dsdt 


o J o L 
1 /•! 


r-1 /•! 


K(s)K(t)p(m + sh, Uj + th)dsdt + 


/i h 

r* 1 /*! 


_1 /Hi 

/i /i 


I\(s)K(t)c(sh — m,Uj + th)dsdt 


j J K{s)K(t)c{sh + m,—Uj + th)dsdt + J u j K(s)K(t)c(—m + sh,—Uj +th)dsdt 


+ 


By Assumption 4.4 J\ (•) is symmetric on [—1,1]. We obtain 

d d d c-^F 


h h 

r r 

h J h 
l2 


K(s)K(t)dsdt = 




cl 


K{s)K(t)dsdt = 


K(s)K(t)dsdt, 


K{s)K(t)dsdt. 


I -1 


Hence for (n*, ) E [0, /i] , we decompose the copula density function into four parts 


c(m,Uj) = 


" i' ■■ 


Uj)K{s)K(t)dsdt. 


h h h h h h h h 

Since c E £ K (2, A), for 0 < m,Uj < h, and — 1 < s, t < 1, we have 

| c{m + sh, Uj + th ) — c(m, Uj )| < ALh 2 , \c(sh — u, Uj + th) — c(iii, Uj)\ < 20 Lh 2 , 

| c(sh + u, th — Uj) — c(m, Uj) | < 20 Lh 2 , \c(sh — u,th — Uj) — c(m,Uj)\ < 36 Lh 2 . 

Therefore, for any point on the left down corner ( m,Uj) E B\, we can then bound the bias term as 
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Ec| td (uj, Uj) - c(v,i , Uj) 


*3 

r-1 rl 


h h 

-1 rl 


Z3_ / 
h h 


Ec^ d (ui,Uj) — j J K(s)K(t)c(s,t)dsdt 

/ / K(s)K(t)\c(ui + sh, Uj + th) — c(uj, Uj)\dsdt 

h h 

+ / / K(s)K(t)\c(sh — Ui,Uj+th) — c(ui,Uj)\dsdt 

J-^Z J ^ 

K (s) K (t)\c(sh + Uj, th — Uj) — c(uj, Uj)\dsdt 
+ J u J K(s)K(t)\c(sh — Uj,th — Uj) — c(ui,Uj)\dsdt < 80Lh 2 . 


Combining the above inequality with the 7~jj term and F.3 , we have the rate Tg < c[h 6 + 
log n/(nh 2 ) + n -8 ]. 

F.1.3 Analysis for Margin Area 7c 

Similar to the structure of B, C has fou r con nected components. Let C\ = [0, h\ x [h, 1 — h\ be the 
left one of the four margins, similar to F.3 , there exists some constant C > 0 such that 


7 C= (^Ch(ui,Uj) — c(uj,Uj)) 2 duj duj < C / (E Ch{uj,Uj) — c(iii,Uj)) 2 dujduj. 

Jc JCi 


For ( Ui,Uj) E Ci, by the definition of mirror reflect estimator in 4.7 , we have 

Uj Ujk 


1 


c h(ui, Uj ) = 

k= 1 L 


K | Uj Ujk 'j K I Uj Ujk 'j + K I Uj + Ujk j K 


h 


h J \ h J \ h 

And we still do the partition as previous into the bias and variance: 

Tc < 2 J (Ec h - E^) 2 dUl duj + 2 J (Ed% d - c) 2 d Ui d Uj =: 2(7^ (1) + T^). 


Similar to the proof of F.l , we can also bound the first term as 7 g 1 ^ < clog n/{nh 2 ) + n 4 . 


For the second term, we plug in the estimator and calculate the expectation directly as 
Ec^ td (uj, Uj) =f f K h (ui - s)K h (uj - t)c(s, t)dsdt + f f K h {uj + s)K h (uj - t)c(s , t)dsdt 

Jo Jo Jo Jo 

11 1 
=J J K(s)K(t)c(ui + sh,Uj + th)dsdt +J J K(s)K(t)c(ui — sh,Uj — th)dsdt. 

h 11 

As c E £ k (2, L) for 0 < u < h, by the definition of Holder class, we have 

| c(ui + sh, Uj + th) — c(uj, Uj) — (v c(ui, Uj), (s, t))h\ < L(s 2 + t 2 )h 2 , 

| c(ui — sh,Uj—th) - c(ui,Uj) + a 7 u ^]) (2 u -\- s h) + dc ^ U3 \ th)\ < L [(2 + s) 2 + t 2 ]h 2 . 
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For the Holder property as |s|, \t\ < 1 again, we have |c(tt* + sh, Uj +th ) — c(ui, uj)\ < 
dc(ui , Uj 


dc{ui , Uj 


dm 


h + 


OUn 


h+L(s 2 +t 2 )h 2 . Similarly, \p(ui — sh,Uj—th)—p(ui,Uj)\ < 9 


dp(x) 

h+ 

dp(x) 

du 


duj 


h+10Lh 2 . 


Ec% (ui,Uj) - c(ui,Uj] 
r i rt 


Hence we can bound the bias term on C\ 

Ec| td (rtj, Uj) — J J K(s)K(t)c(s,t)dsdt 
K(s)K(t)\c( Ui + sh, Uj + th) — c(ui, Uj)\dsdt 


*J- 

+ . 

< 10 


C£ 

dc(ui,Uj) 


du 


K(s)K(t)\c(—sh — u,Uj — th) — c(ui, Uj)\dsdt 

dc( Ui ,Uj) h + ULh 2 < l2Lh 2 + 12Lh 2 = 24 L/l 2 ) 


h + 2 


8Un 


where the last inequality is derived from 


dc{uj,uj) 

du 


dc(ui ,Uj ) 

dv 


< Lh, by the Holder condition and 


boundary assumption in Assumption 4.3 Therefore, we obtain 7 c < c[h 5 + logn/(n/i 2 ) + n 8 ]. 


In summary, combining the upper bound of Ta,Tb and 7c, we prove upper bound in Lemma F.l 


□ 


F.l.4 Bias of the Mutual Information Estimator 

Lemma F.2. Under Assumption |4.3| and |4.4] we have for some constant C\. the bias of mutual 
information estimator can be bounded by 


sup 

cGSk,(2,L) 


EJ(c ft )-/(c) 


<Ci 


logn\ 1//3 


n 


Proof. Let cf)(u) = ulogu and expand it by Taylor’s Theorem, denote A c(ui,Uj) = Ch(ui,Uj ) — 
c(ui, Uj) and we get 

(f) ( Ch(Ui , Uj)) - (j) ( c(Ui,Uj )) = (log(c(Uj, Uj)) + l) • A c(Ui, Uj) + ——^-r • A C 2 (Ui,Uj), 

\ u ii u j ) 


where f(ui,Uj) lies in between cff ui- Uj) and c(ui,Uj). According to the boundedness of copula 
density c(ui,Uj) and Ch, we have < £(m,Uj) < H2- 

Let k = max{| log«i|, | logK 2 |} + 1, applying the connection between the mutual information 
and copula entropy in [4.6|, defining cf(x) = xlogx, we have 


El (c h ) - 1(c) | UU |E H c (c h ) - H c (c )| =E [f (c h (ui,Uj)) - (f> (c(ui,uj))]dui du 


[o,i ] 2 


Fubini 


[ 0 ,lp 


E[(j> (ch(ui,Uj)) - 4>(c(ui,uj))]dui duj 
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Applying Taylor expansion to the last equality above, we have |E/ (c'h) — /(c) | < h + h, where 


h = 
h = 


[o,i ] 2 


(log(c(«i, Uj)) + l) • E[Ac (ui, Uj)]dui duj 
E[A c(ui, Uj)] 2 dui dujdui duj 


[ 0 , 1]2 2 £(ui,Uj) 


By Holder inequality, we have I\ < nJff^ E Ch(ui,Uj) — c(ut,Uj ) ( 


h< 1 // E 

J d [ 0 , 1]2 


C h (Ui,Uj) - (Ui,Uj 


1 2 


dui duj + 


E 


dui duj and 


Cj; d (Ui,Uj) - c(Ui,Uj) 


1 2 


dui duj 


d J [ 0 , 1]2 

Therefore, we can bound the bias of mutual information estimator by 

|EJ (c h ) — /(c)| < c\[h 2 + i/log nj (n 4 / 2 /r) + n” 4 ] + C 2 log n/{nh 2 ) + n ~ 8 + C 3 / 1 4 + C 4 /i _ 2 rc _1 . 


The above inequality follows from Lemma F.l and Lemma F.2 , where ci, 02 , 03,04 are three 
constants. We prove the upper bound by setting h x (log n/n) i/b . □ 

F.2 Variance Term Analysis 

We are going to show an exponential concentration inequality of our mutual information estimator. 
Lemma F.3. If we use the kernel function satisfies Assumption! 4.4| we have 


e SnP L )P(|/(^)-E//( 4 )| > e) < 2 exp 6 j ’ 

where k = max{ | log |, | log K 2 |} + 1 . 

Proof. In order to prove the concentration inequality, we apply the McDiamaid’s inequality 


(F.4) 


arrnid 


1989 


McDi- 


. To use the inequality, we need to bound the change of the esti mato r if we disturb one 
sample point. Let c- h (ui,Uj) be the kernel density estimator defined as in 4.7 by replacing 7-th 
data point (x^, Xjk) with an arb itrary value ((x^)', ( Xjk )')■ As (ft 1 (u) = logu + 1, according to the 
boundedness in Assumption 4.3 we have max [|<//(c) l (u))|, \(ft' ( 0 ^( 11 )) |] < k. 

When ( x^, Xjk ) is replaced by ((x^)', (xjk )'), the ( Uik,Ujk ) will correspondingly move to 
((uik)\ (ujk)') where 


u ir , 


= 


Uim if Uj m fL (u ik A (u ik )', u ik V (u ik )'); 

u im + 1 /n if u ik > (ujk)' and u ik G (u ik )', u ik ) ; 
u im - 1 /n if u ik < (ujk)' and u ik G (u ik , (u ik )' 


for any m k. Similar behaviour will happen to ( Uj m )' as well, so we have for any m k, 
(^im) | ^ 1 /n and \ujm (d^jm) I — 1 /u. 
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For simplicity, we denote K(m)K(uj) as K 2 (u ) and u k ■= ( Ui k ,Uj k ), (u k )' '■= (( upf )', ( Uj k )')• For 
the simplicity of notation, in the following, the supremum sup is taking over xn,... ,Xi n , (xi k )' and 
Xj i,..., Xj n , (, Xjk)'■ With this we have 


sup | I(c h ) -I((? h ) | = sup | 


[ 0 , 1]2 


( c h (ui , itj)) - (/)@ h (ui, Uj))\dui duj I 


< K sup 


[ 0 , 1]2 


|(ttj, Uj) — c h {ui , Uj)|dui duj < k sup 


[0,1] 2 


itj) duj. 


By the definition of the kernel density estimator, we have 



where the last inequality is due to the change of variable. By the Lipschitz property of kernel, we 
can further bound the difference by 


!7Y~\ j-w m ^ l k ii- vll 2 , 8k 8kL k , 8k 16kL k 

SUp \I(c h ) - I(c h ) | < — 2^ Um ~ ( U m ) || 2 + — < -— + — < 


n z ' h 

m^k 


nh 


n 


nh 


Using McDiamaid’s inequality we have 


sup P(|/(c^) - EH(c h )\ > e) < 2 exp ( - 


cGS/c(2,Z/) 

and we prove the exponential concentration inequality 
the bandwidth in Lemma [F.21 


nh z e 


2_2 


128 k 2 L 2 k 


FA 


by setting h x (log n/n) 1 ^ same as 

□ 


G Proof of Theorem 4.10 


Proof. We define the Li-norm ||p||i := / \p(x)\dx and 

Jx 


PF* (x) = P 
U,k)eE 


Ph 2 {Xj,X k ) 

r Ph 2 {xj)p h2 (x k ) u&u 


n^w- n ph i(^)> 

iev\u* 


where U* C U is the set of isolated vertices in the oracle-forest F*. 

It is easy to see that pp*(x) > 0 and Jpp*(x)dx = 1. Let D(p\\q) be the KL-divergence between 
two densities p and q. Using Pinsker’s inequality, we have 

|| PF* Pf* ||i < mm{y/D(p F *\\p F *), y/D(p F * ||pf*)}> where (G.l) 
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D(pf* || Pf*) — 


/ \ i Pf* 0) , 

Pf* (x) log ~ / \ dx 

Pf*{x ) 

, ri(j,fc)es* p( x ji x k) r ri(j /c)gs* p{ x j)p( x k) 

Pf* (x) log ; -——- -dx - / pi?* (x) log ' 


ri(j,fc)eE* i x j)Ph 2 O^fe) 

+ £ /« log + JL _ / *><«> log in*,. 


ri(j,fc)eE* Pha [ x ji x k) 
P( x u ) 


u£U* 


Therefore, the expectation of the KL-divergence can be bounded by 

/ p(x ' Xfc\ /* p(x u ) 

p( x i, Xk) log ^— p———-dxjdxk + s- maxE / p(x u ) log —— F—dx u 

Ph2V^j')^k) u J Vh2\^u) 

f 'p(xp') 

+ (d — s) • maxE / p{xA log ——-—-dxp 
* J Ph i {X£) 

= s ■ maxED(p jk \\p jk ) + s • maxE D(p u \\p^' > ) + (d - s) • max¥.D(pg\\ffl), 

jk u l 

where p jk := p h2 (xj,x k ),pi^ ■= Ph 2 ( x u),ffi '—Ph^d- Similarly, we have 

E D(p F * || Pf*) < s ■ maxED(p jk \\pj k ) + s ■ maxEL>(p^||p M ) + (d - s) ■ maxET(^^||p^). (G.2) 

jk u (. 

Since K] < inf jkPjk, the KL-divergence can be further bounded by 

D{pjk\\Pjk) = j Ph 2 ( x j, x k ) log ^(x.^Xk) dxjdx k < J dx i dxk ~ 1 


{p h2 {xj,Xk)-p{xj,Xk)Y t . 1 n ~ l|2 

p(xj,x k ) dxjdxk < K J Pjk Pjk || 2 - 


(G.3) 


Using 


G.3 


, we know that there exists a constant Ci which does not depend on j, k, such that 

1 2 
sup E D(pj k \\p jk ) < — sup E\\p jk -Pjk\\l < C\n~3. 

Pj k & S k (2,L) k i Pjk ez K (2,L) 


Similarly, we control the terms KD(pp ] \\p u ) and max; D{pf^ \\p() and show that they have the rate 
C *2 • n ~ 2 / 3 and C 3 • n ~ 4 //5 res pecti vely. 

Therefore, by 


G.l 


and 


G.2 


we have for some constant C\ 


sup sup E||pi?* — pf* ||i < C 
FeJ-jpgS K (2,L) 


+ 


d — s 


n 2/3 n 4/5 > 

Using the fact that \\pp — pf* ||i < ||Pp||i + ||p.f*||i < 2, we have 

E l|Pj? -PF*\\i < E||p>. -p F * 111 + E||pp ~PF*\\i • l(F F*) 

ft C ■ \l „ + 


d — s 


n 


2/3 


n‘ 


4/5 


+ 2P (F ^ F* . 
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From 


D.l 


in the proof of Theorem 


3.6 


we have 


sup P (f / F*) = O (V (logn)1/3 ) . 


P&Vk 


The desired result complete the proof of theorem. 


□ 
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